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Due to the rapid growth in multimedia content and its visual complexity, content- 
based image retrieval (CBIR) has become a very challenging task. Existing 
works achieve high precision values at first retrieval levels such as top 10 and 
top 20 images, but low precision values at subsequent levels such as top 40, 
50, and 70, so the goal of this paper is to propose a new CBIR approach that 
achieves high precision values at all retrieval levels. The proposed method com- 
bines features extracted from the pre-trained AlexNet model and discrete cosine 
transform (DCT). Then principal components analysis (PCA) is performed on 
AlexNet’s features and feeding these combination to multiclass support vector 
machine (SVM). The euclidean distance is used to measure the similarity be- 
tween query and stored images features within the predicted class by SVM. Fi- 
nally top similar images are ranked and retrieved. All above techniques require 
huge computational power which may not be available on client machine thus, 
the processing of these tasks is processed on cloud. Experimental results on 
the benchmark Corel-1k show that the proposed method achieves high precision 


value 97% along all retrieval levels top 10, 20, and 70 images and requiring less 
memory compared to other methods. 
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1. INTRODUCTION 

People are increasingly coming into contact with a large amount of image information as a result of 
the rapid development and popularisation of digital technology, computer and network technology, and images 
have become a common carrier to describe and store the information. Image retrieval (IR) is one of the most 
popular image processing research areas. At the moment, the majority of web based image search engines 
rely solely on metadata connected with images, such as keywords, tags, or descriptions, this method called 
text based image retrieval (TBIR) and this may result in a large number of false detections. Furthermore, 
manually adding keywords for images in a huge database can be wasteful and may not catch every keyword 
that characterises the image. As a result, the performance of these systems is unsatisfactory. 

Content-based image retrieval (CBIR) has recently become essential due to its ability to overcome 
the existing challenges. The main purpose of CBIR is to extract key visual features of images, such as tex- 
ture, colour and shape and determine the degree of similarity among images using similarity measures. As 
a result, the two most critical elements impacting CBIR efficiency are feature representations and similarity 
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measurements. Several low-level feature descriptors for image representation have been proposed in the past, 
ranging from global features such as colour [1], texture [2], and shape [3]. Interest points detectors like his- 
togram of oriented gradients (HOG) [4], scale-invariant feature transform (SIFT) and speeded-up robust 
features (SURF) [6]. Image representations relying on one type of feature could result in unsatisfying CBIR 
performance because of insufficient representation of the images’ visual contents. Jabeen et al. proposed 
an image retrieval system rely on the features fusion of speeded-up robust features-fast retina keypoint (SURF- 
FREAK) feature descriptors on the basis of the bag-of-visual-words (BoVW) model, to overcome the semantic 
gap and increase image retrieval efficiency. Elnemr [8] proposed an image retrieval system that combines the 
SURF and maximally stable extremal regions (MSER) approaches. The SURF detector can recognise features 
such as blobs and corners, but it is unable to detect keypoints respect to regions. It is also noise sensitive and 
rotation and scale invariant, however it is not affine. MSER, on the other side, can detect features surround- 
ing an object’s region but cannot detect corner or blob features. MSER is also rotation, scaling, and affine 
transformation invariance. 

In recent years, machine learning algorithms have been widely used and have produced good results. 
Deep learning is a significant subfield of machine learning. Deep learning techniques, specifically convolutional 
neural networks (CNN), has widely used and achieved great improvement in image processing field. A CNN is 
composed of several hidden layers that execute mathematical computations on the input given by the previous 
layer and produce an output that is fed into the next layer. Over the past recent years, CNNs have improved 
the performance of computer vision systems, including feature extraction [9], image classification [10]-(15}, 
pattern recognition and speech recognition [17]. 

There are many researchers, who have used CNNs to improve CBIR and achieving significant im- 
provements. Shah et al. [18], trained a deep CNN’s AlexNet framework, where the authors utilized eight 
trained layer network with The first five layers of the network are convolutional, and the remaining layers 
are fully connected. They have utilized the features extracted from the seventh trained layer to obtain simi- 
lar images. However, CNN features have higher dimensionality and unskillfulness of resemblance calculation 
between a pair of vectors with 4,096 dimensions. Later, dimensionality mitigation was proposed in order to 
reduce the dimensionality of the features where in [19], proposed a combination of AlexNet CNN features, 
local binary pattern (LBP), and HOG features, The principal components analysis (PCA) used to reduce the 
dimensions of the HOG descriptor to 1x59. Then, The feature vectors of HOG-PCA and LBP are combined to 
create a new handcrafted feature vector with a dimension of 1x118. To match the dimension of the handcrafted 
feature vector with the dimension of the deep feature vector, the handcrafted feature vector is processed by 
PCA, and 64 of the 118 features are selected to create a handcrafted-PCA feature vector with a 1x64 dimen- 
sion. Finally, a combination of the deep feature vector and the handcrafted-PCA is performed and an efficient 
image descriptor with a dimension of 1x128 is created. 

Recently, pre-trained CNN models with transfer learning approach have the ability to produce and ex- 
tract effective and descriptive features from image data and achieving high accuracy result as in [20]. Maji and 
Bose proposed a new CBIR approach in which features are obtained from pre-trained network models from 
a deep learning convolution network trained for a large image classification problem. Ahmed proposed 
CBIR systems based on features extracted using pre-trained CNN models ResNet18 and SqueezeNet. They 
employed these pre-trained CNN models to extract two groups of features that are stored separately and then 
later are used for online image searching and retrieval. Experimental results on the popular image dataset Core- 
1K show that ResNet18 features based on the CBIR method have overall accuracy of 95.5% on top 10 retrieval 
images. Jiang suggested a new approach for CBIR based on image feature fusion and fisher encoding 
(FV). First, image blocks are used to extract low-level image content features such as hue-saturation-value 
(HSV) histograms, uniform LBP, and dual-tree complex wavelet transform (DTCWT). In contrast, high-level 
features are retrieved using the AlexNet CNN. The LBP and DTCWT were subjected to the singular value 
decomposition (SVD). Second, low-level features are merged using normalisation and weights. Finally, after 
utilising the FV encoding, the fused fisher vectors are utilised to quantify the similarity of picture pairings. The 
testing findings on the benchmark Corel-1k reveal that the accuracy on the top 10, 12, and 20 images returned 
are 93.4%, 92.8%, and 91.4%, respectively. 

Keisham and Neelima proposed efficient content-based picture retrieval strategies, which are dis- 
cussed with machine learning (ML) algorithms. Pre-processing, multiple feature extraction, feature fusion, 
clustering, and classification are all processes in the proposed deep neural network-synthetic aperture radar 
(DNN-SAR). In the pre-processing step, a fast average peer group (FAPG) filter is utilised to reduce noise. 
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Then, numerous features such as colour, shape, and texture are extracted, and feature vectors are computed. 
Using average and weighted average approaches, all three characteristics are combined into a single feature. 
Following that, the fused features are grouped using the adaptive sunflower optimization (SFO) method. Fi- 
nally, the appropriate photos are extracted using the DNN-SAR optimization process. mAP value of suggested 
(DNN-SAR) in terms of Corel-1k (93.91%) on top 10 retrieval images. 

The drawback of the existing works that they achieving good precision value at first retrieval levels 
e.g. top 10 and top 20 retrieval images but achieving low precision at the remaining levels e.g. top 40, 50 
and 70 and that drawback will be overcome in our research. The utilisation of new technology, like cloud 
computing, is active in its successful application. Cloud computing is defined as “transferring the process from 
the user’s machine to servers on the internet, and storing the user’s data to be accessible from any location and 
any machine,” the software becoming services, and the user’s computer becoming only an interface as in [25]. 
The design of CBIR through cloud computing is shown in Figure 1. 
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Figure 1. The design of CBIR through cloud computing 


In this paper, a new CBIR method is proposed to achieve high precision value along all retrieval levels 
with less calculation complexity through advantages of cloud computing. The proposed approach is based on a 
combination of pre-trained AlexNet CNN for features extraction followed by PCA for dimensionality reduction 
integrated with discrete cosine transform (DCT) of entire image and feeding these combination to multiclass 
support vector machine (SVM) for classification and finally euclidean distance is used for similarity measure 
between query and stored images using the extracted features. This paper is organized as follows. Sections 2 
presents the proposed image retrieval method. Section 3 presents results and discussion. Section 4 provides 
conclusion. 


2. METHOD 

In this paper a new content-based image retrieval method called deep learning content-based image 
retrieval using cloud computing (DLCBIR) is proposed in order to achieve better retrieval results for the CBIR 
system. In the rest of this section, the basic idea of DLCBIR is introduced, and then the steps of the proposed 
approach are described. A CBIR system typically has two phases, the offline phase and the online phase, which 
will be described at the end of this section. 


2.1. Basic idea 

The basic idea behind CBIR is to find similar images in a large database based on a query image. 
Typically, some useful features are extracted from query and database images, and retrieve images which have 
similar set of features. In our work we utilize a deep learning in order to extract these features integrated with 
DCT of entire image. Also in order to accelerate features similarity process, we apply dimension reduction 
approach on extracted features. We also use multiclass SVM in order to improve accuracy result. All above 
techniques requires huge amount of computing power, which may not be available with client machine, thus 
this processing is done on cloud. 
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DLCBIR consists of six phases which are i) geatures extraction, ii) dimensionality reduction, 
iil) feature vector normalization, iv) feature vectors combination, v) multiclass classification, and vi) simi- 
larity determination. There are six phases of the proposed DLCBIR approach. These phases are described as 


Figure 2. 
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Figure 2. Six phases of the proposed DLCBIR approach 


2.2.1. Features extraction phase 


In this phase, the AlexNet CNN is used for extracting all features from images dataset. AlexNet CNN 
is a modified version of CNN. CNN is a valuable research topic in the field of machine learning and computer 
vision. CNN is consisted of multiple hidden layers that execute mathematical computations on the input given 
by the previous layer and produce an output that is fed into the next layer, as shown in Figure 3 a CNN varies 
from neural networks in that it has convolutional layers, which can be a good model to detect correlations 
between neighbouring pixels rather than fully connected layers. The training stage is typically very expensive 
in terms of computing and can take a long time to accomplish. The time for prediction is quite fast and efficient 
once the network training step is completed and the classifier has been initialised appropriately. 
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Figure 3. The architecture of CNN 


AlexNet is a CNN which had a significant impact on the area of machine learning, especially in 
terms of applying deep learning to machine vision. The AlexNet has already been trained on the ImageNet 
Dataset, which has over 15 million pictures and 22,000 class labels, significantly more than a normal training 
dataset. When working with images of popular items from the ImageNet dataset, this can indeed result in a 
somewhat good classifier. Thus, we use a pre-trained CNN’s AlexNet for feature extraction in this work. 

The AlexNet is composed of eight trained layers. The first five layers are convolutional, whereas the 
last three layers are fully connected. To accelerate the train, the rectified linear unit (ReLU) is applied after all 
convolutional and fully connected layers. Dropout is used before the first and second fully connected layers. 
So, in this phase, the images are read and resized to xKxz (e.g., 227x227x3). This work use the pre-trained 
7th layer for feature extraction with a feature vector of length 4096 per image [18]. The CNN AlexNet process 
started by extracting features from the image dataset. Then stores the extracted features for further processing. 
Figure 4 shows an example of the AlexNet architecture. 
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Figure 4. AlexNet architecture 


2.2.2. Dimensionality reduction phase 

In this phase, to accelerate image retrieval process and improving its performance, dimension reduc- 
tion on the features extracted of 7th pre-trained layer (FC layer) of AlexNet CNN is applied by using PCA. 
PCA is a useful method in data analysis for reducing dimension and to obtain maximum variance of data. On 
the other hand, DCT is used for entire image features compression without losing too much performance. This 
process is described as follows. 

DCT has good energy accumulation characteristics and can still maintain performance during 
dimensionality reduction [28]. The 1D discrete cosine transform X(k) of a finite sequence x(n) of data with 
length N is defined as (1). 


N-1 
X(k) =a So x(n) cos (7 *) (1) 


n=0 


where 
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fe = -k=0 
a(k) e - 


The two-dimensional transform is equivalent to a one-dimensional DCT performed along a single dimension 
followed by a one-dimensional DCT in the other dimension. One of the main characteristic of DCT is its ability 
to convert the energy of the image into a few coefficients by cluster high value coefficients in the upper left 
corner and low value right of the image. Thus, applying DCT on the image and taking the first K significant 
coefficients extracted in a zigzag order started from the upper left corner from the transformed image can be 
used as feature vector that represent the image with a few coefficients without losing too much performance. 
The number K of coefficients to keep is determined experimentally. The higher number of taken coefficients 
makes high quality of the representation. 

PCA is a versatile technique and has been widely used and achieving a good result in various ap- 
plications such as dimensionality reduction, data compression, and feature extraction [30]. The advantages 
of using PCA method are reduce the dimensionality of a data set by finding a new set of variables smaller 
than the original set of variables, retains most of the sample’s information and help in classification of data. 
Principal components can be identified by calculating the eigenvectors and eigenvalues of the data covariance 
matrix. Following give details about PCA method. Suppose we have matrix A which contains the term weights 
obtained by feature extraction techniques: 


Xi, Xyq « Xip . Xim 
Xo, Xoq .. Xop .. Xam 
AG@ia|s «2 & & & 4 (2) 


Xni Xn2 . Xnk . Xnm 
where xj, QG=1,2,....n; k=1,2,...,m) is the terms weight that exists in the collection of vectors. Where n is the 
number of images to be classified and m is the number of term weights obtained from feature extraction. The 


used steps by PCA to reduce the dimensionality of matrix A are described as follows: 
step 1: calculate the mean of m variables in matrix A: 


7 1 n 
XxX; = — ; 3 
k a Pi (3) 


step 2: calculate the covariance Sj; of m variables in matrix A: 


Sik = 5 tii — Xi) (jx — Xx) (4) 
j=l 
wherei=1,.. . ,m. Eigenvectors and eigenvalues of the covariance matrix are computed, and 


principal components are selected. Then we select the first d < m Eigen vectors where d is the desired value 
corresponding to the d largest eigenvalues of the covariance matrix C. Finally, a matrix M with dimension nxd 
is represented as (5). 


Jia Jie 00s ae Ftd 
far foo fog «fra 
M=|]|. ‘ e Sas 4 (5) 
fri fn2 fn3 . fina 
Where f;; is a reduced feature vectors from the nxm original data size to nxd Size. The PCA algorithm is used 
in our work to reduce feature vector size of each image that extracted of 7'” pre-trained layer (FC layer) of 
AlexNet CNN from |x 4096 to Ix M (e.g., 1 x 64) and obtain maximum variance of data. 
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2.2.3. Feature vectors normalization phase 

Normalization gives equal weight to all features in a data set and thus be useful for classification 
algorithms. Normalization can improve classification model prediction performance as in [31]. The normal- 
ization process is done by considering the values in the vector. For example, if the vector is of size 1x4: 
[4,6,9,11]. To normalize it we need to calculate the 12-norm for this vector, which is \/4? + 6? + 9? + 112 = 
15.93. Then divide each of the vector values with this 12-norm:| that is, equal to 
(0.25, 0.37, 0.69, 0.56). 


4 6 9 11 ] 
15.93? 15.937 15.93? 15.93 


2.2.4, Feature vectors combination phase 

In this phase, a combination of a normalized feature vector produced by DCT and a feature vector 
produced by PCA is done and created a finally features vector that represent the images. For example, the DCT 
feature vector with dimension | x M (e.g., 1x10) and PCA feature vector with dimension | x N (e.g., 1x64) are 
combined and an efficient image descriptor with a dimension of 1x(M+N) (e.g., 1x74) is created. 


2.2.5. Multiclass classification phase 

In this phase, multiclass SVM is used for classification to increase the accuracy of the proposed 
approach along all retrieval levels. While categorizing a particular image, there are N different classes to which 
the given image can be placed. Therefore, it is required to construct a function which can effectively predict 
the class to which the given image belongs. SVMs are primarily designed for binary classification that is for 
only two classes possibility. For more than two classes, there is no SVM equivalent to multinomial regression. 
Rather, the outputs of individual two-class SVMs are combined. There are several ways to accomplish this. Our 
implementation applies the “one-against-one” approach as shown in Figure 5, the support vector classification 
procedure (for a k number of classes) is executed k (k -1)/2 times for each possible pair of these classes. For 
each pair, the winning class is the one with the highest points among all two-class SVMs [32]. 


One vs One (OVO) 


Figure 5. An example illustrating one-against-one” approach a multiclass SVM 


2.2.6. Similarity determination phase 

In this phase, the euclidean distance is used for similarity measure between query and stored images. 
Euclidean distance is the most appropriate measure for determining similarity due to its popularity and sim- 
plicity of computation. Query image feature vector is compared with dataset feature vectors within the class 
that predicted from multiclass SVM using euclidean distance. A set of relevant images is selected then they 
arranged in descending order of their euclidean distance score to retrieve top N. 


(6) 


In (6) X and Y feature vector of query image and feature vector of image in the database while x and y are the 
element in these vectors. 


A deep learning content-based image retrieval approach using cloud computing (Mahmoud S. Sayed) 


1584 im) ISSN: 2502-4752 


2.3. The online and offline processes proposed approach 

The proposed CBIR framework as shown in Figure 6 includes two types of process modes. Online 
process (on the left side of Figure 6) and offline process (on the right side of Figure 6). During the offline phase, 
a feature database is created for each image in the database and the multi-class SVM is trained on these features. 
The online process mode, on the other side, is based on the user interface, where the features are extracted from 
the query image given by users. From this, the distance measure is then used to compare the provided image 
feature to the features database within the predicted class using the trained SVM. These distance measurements 
are sorted to rank the images based on their similarities and then retrieved top ranked images. Figure 6, shows 
an overview of the online and offline processes of the proposed DLCBIR system. 


Query Image 


Preprocessing 


Images Collection 
Preprocessing 


Features Extraction 
using proposed Model 


Features Extraction 
using proposed Model 


Query Image 
Features 


Similarity 
Measure within 
predicted Class 


Predict Class Label 
using trained SVM 


Features Database 


Online Process 


Offline Process 


Figure 6. Overview of the online and offline processes of the proposed DLCBIR approach 


3. | RESULTS AND DISCUSSION 
3.1. Dataset description 

The commonly dataset used in most image retrieval and classification research is Corel-1k dataset 
[33]. Therefore, the performance of the proposed DLCBIR is examined using Corel-1k dataset. The Corel-1k 
dataset composed of 10 categories, each one contain 100 images with a resolution of 256x384x3 or 384x256x3 
pixels. We used 70% of images per class for training and 30% for evolution. Six samples of each type are shown 
in Figure 7. 


Figure 7. Images samples from each category in the Corel-1k dataset from left to right 
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3.2. Performance evolution 
In this section, the performance of the proposed DLCBIR is evaluated and measured against the ex- 

isting systems in [7], 8], [18-24]. The proposed DLCBIR retrieves a set of relevant images from the dataset 

based on their euclidean distance score. The performance of all methods is measured using precision [34]. 

Precision is a metric that quantifies the number of correct positive predictions made. Precision can computed 

as (7): 

No. relevant images retrieved 


(7) 


Precision = 
Total No. images retrieved 


For a given query q, the corresponding average precision AP is calculated, and then the mean of all these APs 
scores is calculated which is called mAP and is computed as (8). 


N 
1 
mAP = W 2, AveragePrecision (8) 
Table 1 shows the average precision AP of the proposed DLCBIR for each class by using Corel-1k 
dataset. The proposed system achieved high average precision AP on each category by using Corel-1k dataset. 
Figure 8 show the confusion matrix for the proposed DLCBIR on corel-1k dataset and Figure 9 shows most five 
similar images from each category retrieved by the query image in our proposed DLCBIR on Corel-1k dataset. 


Table 1. Category-wise average precision results of DLCBIR at the top 10 retrieval images 


Categories AP% 

Beaches 93.33 

Bus 100 
Dinosaurs 100 
Elephants 100 
Flowers 100 
Foods 90 
Horses 100 
Monuments 100 
Mountain and snow 90 


People and villages in Africa 96.67 
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Figure 8. Confusion matrix for the proposed DLCBIR on corel-1k dataset 
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Figure 9. The visual results of query image using proposed DLCBIR on Corel-1k dataset 


Figure 10 show comparison results of our proposed method with other method for Corel-1k dataset in 
term of precision for top 10 retrieved images. Figure 11 show precision graph with varying number of retrieved 
images for Corel-1k. Numbers of retrieved images are 10, 20, ..., 70. The proposed method is showing high 
precision value along all levels among all compared methods. Table 2 shows the comparison of proposed 
DLCBIR with the existing methods in terms of the mAP by using Corel-1k dataset. The achieved results proof 
that the proposed DLCBIR can achieve higher precision value and requiring less memory compared to existing 


retrieval systems [7], [8], (18)-[24]. 
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Figure 10. Comparison results of our proposed method Figure 11. Precision Graph with varying number of 
with other method for Corel-1k dataset in term of retrieved images for Corel-1k 
precision for top 10 retrieved images 


Table 2. The mAP results of DLCBIR and other methods on Corel-1k dataset at the top 10 retrieval images 


Method mAP% Dimension 

SURF + FREAK 86 1 x 128 

SURF + MSER 88 1 x 128 

AlexNet CNN 93.80 1 x 4096 

AlexNet + HOG + LBP 95.80 1 x 128 

ResNet50 96.11 1x 100 

ResNet18 95.50 1x 512 
AlexNet + LBP + DTCWT 93.4 = 
DNN-SAR 93.91 - 

Proposed DLCBIR 97.00 1x74 


4. CONCLUSION 

In this paper, a new algorithm to retrieve similar images through advantages of cloud computing called 
DLCBIR is proposed. DLCBIR is based on the pre-trained AlexNet CNN features followed by PCA method 
integrated with features extracted from DCT of entire image and feeding these combination after normalization 
process to Multiclass SVM method. The combination of features extracted from DCT and the features extracted 
from AlexNet-PCA was used because it will give a good precision compared to use one of them separately. 
The multiclass SVM used to increase the performance where the similarity measure between query and stored 
images occurred within the class which is predicted by it. In addition, the euclidean distance measure was used 
as the similarity metric to retrieve images that is most like the query image from the database. The results of 
conducted experiments on the Corel-1k dataset showed that DLCBIR achieves high precision value at different 
precision level which was 97% compared to other existing systems, for the correctly classified and retrieved 
images in the test data. In future work, the proposed DLCBIR will be improved through implementing DLCBIR 
in parallel computation to retrieve images from large databases while decreasing the time necessary for training 
and extracting features from the databases. 
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